class: center, middle, inverse, title-slide # Comparing microbial communities with OptiFit: a fast method for fitting sequences to existing OTUs ### Kelly Sovacool ### 2021-03-11 --- background-image: url("data:image/png;base64,#../../figures/SchlossLab.png") background-size: contain background-color: "#000000" class: inverse ??? Sarah Westcott, software engineer. implemented algorithms I'm discussing today. --- ## Role of the microbiome in human health .center[ <img src="data:image/png;base64,#https://raw.githubusercontent.com/kelly-sovacool/bioinf603-fall2019/master/docs/figures/human_microbiome.gif" height="500"/> ] --- ## The gut microbiome changes in colorectal cancer .pull-left[ - Colorectal cancer patients have an altered gut microbiome compared to healthy individuals. - Fecal matter transplants from cancerous mice increase tumor formation in germ-free mice. - Machine learning models trained on microbiome data to predict CRC have modest performance (mean AUROC ~0.7) ] .pull-right[  ] .footnote[ [Zackular et al. (2013) _mBio_](doi.org/10.1128/mBio.00692-13) ] --- ## Characterizing microbial communities .pull-left[ <img src="data:image/png;base64,#../../figures/omics-cascade.jpg" height=500/> ] -- .pull-right[ - Genomics: Which microbes are there & what are they capable of doing - Transcriptomics: What are they trying to do - Proteomics: What are they actually doing - Metabolomics: How are they affecting the host & each other .footnote[ [Haukaas et al. (2017) _Metabolites_](doi.org/10.3390/metabo7020018) ] ] --- ## Sequencing techniques | | Metagenomics | Amplicon Sequencing | |------|--------------|----------| | **Technique** | Whole-metagenome shotgun sequencing: every gene of every microbe | Sequence just 250bp of the 16S rRNA gene | | **Cost** | 💲💲💲 | 💲 | | **Taxonomic resolution** | Species or strain | Genus | | **Organisms** | Bacteria, archaea, viruses, fungi | Bacteria & archaea; or fungi separately | | **Purpose** | Identify which microbial genes are present; assemble microbial genomes | Identify which microbes are present and their abundances | ??? metagenomics is prohibitively expensive for very large datasets --- ## Operational Taxonomic Units (OTUs) - Cluster similar sequences together into OTUs as a proxy for genus/species. - OTUs clustered at 3% distance (97% sequence similarity) generally can be labelled at the genus level. .center[ <img src="data:image/png;base64,#https://raw.githubusercontent.com/kelly-sovacool/bioinf603-fall2019/master/docs/figures/OTU_clustering.png" height="400" /> ] .pull-right[Image credit: M. Brodie Mumphrey] --- ## Why cluster sequences into OTUs? - Bacterial taxonomy is imperfect. Some species are overly broad, others are too narrow. - e.g. some strains of _E. coli_ share only 20% of their genomes. - Many bacteria have multiple copies of the rRNA gene. - Pick a sequence distance threshold to minimize splitting sequences from the same genome into different OTUs. - Bacteria evolve rapidly. Don't want to exclude novel microbes from an analysis. ??? Bacterial taxonomy is hard Goal of classification is to reflect evolutionary history. Challenges: - Asexual reproduction - Fast evolutionary change - Horizontal gene transfer Current classifications are far from perfect. Courtney's resolution figure? --- ## Evaluating OTU Quality Pairs of sequences that are under the distance threshold should be assigned to the same OTU; sequences above the threshold should be assigned to different OTUs. Confusion Matrix: | | Actual Positive | Actual Negative | |---|---|---| | Predicted Positive | TP | FP | | Predicted Negative | FN | TP | -- Matthews Correlation Coefficient: `$$MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}$$` Range of MCC: - 1 - perfect prediction - 0 - random - -1 - completely wrong --- ## OTU Clustering methods - _De novo_ - Cluster sequences based on their similarity to each other. - Limitation: OTUs aren't stable; you'll get different OTU assignments when new sequences are added. - Reference-based - Cluster sequences based on their similarity to a reference database. - Limitation: novel sequences won't be represented. OTUs are limited by the diversity represented in the database. .center[ <img src="data:image/png;base64,#../../figures/schloss-2016-msystems_all-method-comparison.png" height="350"/> ] ??? _De novo_ methods out-perform reference-based methods. .footnote[ [Westcott & Schloss (2015) _PeerJ_ ](https://doi.org/10.7717%2Fpeerj.1487); [Schloss (2016) _mSystems_ ](https://doi.org/10.1128%2FmSystems.00027-16); [Westcott & Schloss (2017) _mSphere_](10.1128/mSphereDirect.00073-17) ] --- ## OptiClust: _de novo_ OTUs that optimize the MCC 1. Calculate all pairwise distance scores. .center[ ``` seqA AGGGTACG seqB ACG-TACG ``` `$$distance = 1 - sequence\ similarity = 1 - \frac{number\ of\ matches}{sequence\ length}$$` ] 1. Begin with each sequence assigned to its own OTU. 1. For each sequence, consider whether to move it to a different OTU or not. Do whichever option results in the best MCC score. Repeat until the MCC doesn't change. .center[ <img src="data:image/png;base64,#../../figures/opticlust-supp-ex.png" width="700"/> ] .footnote[ <img src="data:image/png;base64,#https://raw.githubusercontent.com/mothur/logo/master/mothur_RGB.png" height="150"/> ] ??? if equally good, pick one at random. opticlust is great, but still has limitations of _de novo_ clustering: OTU assignments aren't stable when new sequences are added. --- ## OptiFit: assign new sequences to existing OTUs 1. Start with OTUs that were previously clustered _de novo_ to use as a reference. 1. Calculate pairwise sequence distance scores between all reference and query sequences. 1. Begin with each query sequence randomly assigned to a reference OTU. 1. For each query sequence, consider whether to move it to a different OTU or not. Do whichever option results in the best MCC score. Repeat until the MCC doesn't change. .pull-left[ <img src="data:image/png;base64,#https://raw.githubusercontent.com/mothur/logo/master/mothur_RGB.png" width="200"/> ] --- ## Reference-based strategies - Modes - Closed - sequences that can't be fit to the reference are thrown out. - Open - sequences that can't be fit to the reference are then clustered _de novo_ to create additional OTUs. - Reference choice - External, independent database (SILVA, greengenes, RDP). - Split the dataset into a reference & query set. --- ## Reference clustering against a database Questions - Which database should we use? - SILVA, Greengenes, Ribosomal database project - How does the performance of fitting datasets to independent databases compare to _de novo_ clustering? - Are the results consistent across datasets from a variety of sources? - Human gut, mouse gut, marine, soil - How does mothur compare to vsearch? --- ## Reference clustering against a database .pull-left[ <img src='../../exploratory/2021/figures/mcc_fit-db-1.png' /> ] -- .pull-right[ <img src='../../exploratory/2021/figures/fraction-mapped_fit-db-1.png' /> ] ??? - _De novo_ and open-reference perform equally well. - Closed reference performs slightly worse. - RDP doesn't perform well. - Prior to switching to silva v132 & updating mothur, closed-reference artificially outperformed open-reference. --- ## Reference clustering against a database .pull-left[ <img src='../../exploratory/2021/figures/runtime_fit-db-1.png' /> ] -- .pull-right[ <img src='../../exploratory/2021/figures/memory_fit-db-1.png' /> ] --- ## mothur vs. vsearch <img src='../../exploratory/2021/figures/metrics_all-human-1.png' /> ??? - Just showing the human dataset here because results patterns were basically the same across all datasets. - Only used greengenes for reference because that's what QIIME does. --- ## Splitting datasets into reference & query sets Questions - What fraction of sequences should be in the reference? - How to select which sequences to be in the reference? - Simple random sample - Weight by sequence abundance - Weight by connectivity (number of pairs within the distance threshold) - How does the performance of fitting datasets to reference fractions compare to de novo clustering? - Are the results consistent across datasets from a variety of sources? - Human gut, mouse gut, marine, soil --- ## Splitting datasets into reference & query sets <img src='../../exploratory/2020-05_May-Oct/figures/mcc_complement-4.png' /> --- ## Splitting datasets into reference & query sets <img src='../../exploratory/2021/figures/fraction-mapped_fit-split-1.png' /> --- ## Splitting datasets into reference & query sets <img src='../../exploratory/2021/figures/runtime_fit-split-1.png' /> --- ## Summarizing mothur results <img src='../../exploratory/2021/figures/mcc_all-opti-1.png' /> ??? Whenever there's virtually no variation in the metric across random seeds. --- ## Conclusions - If you need to assign new sequences to existing OTUs, OptiFit results in OTUs of similar quality as OptiClust. - _De novo_ clustering still usually outperforms reference clustering against an external database. - VSEARCH runs faster but produces lower quality OTUs than OptiFit or OptiClust. - OptiFit runs slower than OptiClust in some cases; continue using OptiClust if stable OTUs aren't required. -- ## What's next - Demonstrate that OptiFit works well for machine learning. - Integrate OTUs with other microbial omics data. --- ## Managing Bioinformatics Pipelines .center[ <img src="data:image/png;base64,#https://snakemake2019.readthedocs.io/en/latest/_images/workflow_management.png" width="750"/> ] .footnote[ [snakemake.readthedocs.io](https://snakemake.readthedocs.io) ] ??? I ran about 20,000 jobs on the GreatLakes HPC --- ## Pipelines with <img src="data:image/png;base64,#https://raw.githubusercontent.com/snakemake/snakemake/f55801bfabfb916f77fa4ec8f9eb0667e5c8a92e/images/biglogo.png" height="35"/> .center[ <img src="data:image/png;base64,#https://raw.githubusercontent.com/snakemake/snakemake/f55801bfabfb916f77fa4ec8f9eb0667e5c8a92e/docs/project_info/img/idea.png" width="1000" /> ] .footnote[ [snakemake.readthedocs.io](https://snakemake.readthedocs.io) ] --- ## Tools for Reproducible Bioinformatics | package mgmt | scripts | reports | workflow | version control | |---|---|---|---|---| | <img src="data:image/png;base64,#https://docs.conda.io/en/latest/_images/conda_logo.svg" width="175"/> | <img src="data:image/png;base64,#https://longhowlam.files.wordpress.com/2017/04/rp.png" height="175"/> | <img src="data:image/png;base64,#https://ulyngs.github.io/rmarkdown-workshop-2019/slides/figures/rmarkdown.png" height="165"/> | <img src="data:image/png;base64,#https://avatars.githubusercontent.com/u/33450111?s=400&v=4" height="175"> |<img src="https://miro.medium.com/max/2625/1*8fPMdk2Cd5iJQ7dI7jXCbA.jpeg" height="175"/> | -- .center[ ### ...and headache prevention! ]